About the project

How are your feeling right now?

All of this is new and quite promising, I do not know anything about the subject and learning will be probably challenging. However I am eager to start. I have experienced some difficulties in this first task as I am unfamiliar with the whole thing. In the end, I succeeded to overcome those problems.

What do you expect to learn?

I’d like to learn about processing of large datasets to be used in my research. My specialization is in political science and organizations. So, I would like to learn mainly about working with statistics and also with data visualization.

Where did you hear about this course?

My supervisor has warmly recommended me to enrol in this course.

MY GITHUB REPOSITORY


Chapter 2: Regression and Model Validation.

2.1.Data Analysis.

2.1.1.Reading of the dataframe “learning2014” and exploration of its dimension and structure.

First of all, I read the new dataframe with the function read.table().

Then I use of the function dim() to show the dimension of the dataframe that is of 166 objects and 7 variables as explained in the above-mentioned data wrangling section:

## [1] 166   7

By typing the function str(), it shows the structure of the dataframe:

## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

2.1.2.Graphical overview of the data and summary of their variables.

To visualize the data I type the functions install.packages() to install the visualization packages ggplot2 and GGally. Then by using the function library() I open them in the project.

install.packages(“ggplot2”)
install.packages(“GGally”)
library(ggplot2)

Visualizing and Exploring the dataframe.

The libraries are opened:

## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

By using the fast plotting function pairs(), we can see a scatterplot matrix resulting from the draws of all the possible scatterplots from the columns of the dataframe. Different colors are used for males and females.

This second plot matrix is more advanced, and it is made with ggpairs().

The summary of the variables:

##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Output interpretation and description, distribution of the variables and their in-between relations.

The females are almost double of the males who present a wider age range. The summary suggest a significant correlation for surf vs deep and points vs attitude.

2.1.3.Multiple Regression Model.

I have chosen three variables “attitude”, “deep learning” and “surface learning”, with the target variable “exam points” to fit a regression model analysis.

Drawing a plot matrix with ggpairs().

Fitting the regression models with three explanatory variables and running the summary:

## 
## Call:
## lm(formula = Points ~ Attitude + deep + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9168  -3.1487   0.3667   3.8326  11.3519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.3551     4.7124   3.895 0.000143 ***
## Attitude      3.4661     0.5766   6.011 1.18e-08 ***
## deep         -0.9485     0.7903  -1.200 0.231815    
## surf         -1.0911     0.8360  -1.305 0.193669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1876 
## F-statistic:  13.7 on 3 and 162 DF,  p-value: 5.217e-08

Commentary and interpretation of the results.

MISSING

2.1.4.Explanation of the relationships between the chosen explanatory variables and the target variable. With the summary of the fitted model (interpret the model parameters). Explaination and interpretation of the multiple R squared of the model.

The adjusted R square of 0.1876 denotes a poorly fitting function to the explanatory variables. Only attitude presents statistical significance.

2.1.5.Diagnostic plots

Residuals vs Fitted values.

Normal QQ-plot.

Residuals vs Leverage.

Explanation of the model’s assumptions and interpretation of their validity on the bases of the diagnostic plots.

A Multiple Linear Regression Model has few assumptions:
a linear relationship between the target variable and the explanatory variables, usually revealed by scatterplots;
a multivariate normality, which means that the residuals are normally distributed. The QQ-plot can reveal it;
the absence of multicollinearity, in other words, the explanatory variables are not highly correlated to each other.
homoscedasticity: or constant variance of errors. There is a similar variance of error terms across the values of the explanatory variable. A plot of standardized residuals versus predicted values shows if the points are equally distributed across all values of the dependent variables.
The diagnostic plots delivered the following observations:
In the residuals vs fitted values, the plot is utilized to check the assumption of linear relationship. An horizontal line, without distinct patterns is an indicator for a linear relationship, in this case, the red line is more or less horizontal at zero. Hence here linear relationship could be assumed.
In the normal QQ-plot the plot reveals the presence of multivariate normality. A good indication is if the residual points follow the straight dashed line. For the majority it is the case here, hence normality can also be assumed.
In the residuals vs leverage, the plot identifies the impact of a single observation on the model. Influential points lie at the upper or at the lower right corner in a position where they are influential against a regression line. In this case the points are on the left side of the plot, thus we can say that there is no leverage.


Chapter 3: Logistic Regression.

3.2.Data Analysis.

3.2.2.Reading of the dataframe “alc”, its description and printing of variables’ names.

In this chapter, we will analyze a dataset resulted from the data wrangling of another dataset about students’ performance in high school in mathematics and portugese language. Hereinafter are the variable of the dataset we are going to analyze:

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Alltogether the dataset has a dimension of:

## [1] 382  35

3.2.3.Choosing four interesting variables, and hypotesis’ formulation in relation to alcohol consumption.

I choose four variables of interest on which I build some hypotheses:

VARIABLE CHOSEN RELATED HYPOTHESIS
"goout" H1: Students who go out more have higher alcohol consumption.
"freetime" H2: Students who have more free time are more prone to drink.
"studytime" H3: The more the studytime, the less a student drinks.
"romantic" H4: Given the courting dynamics, romantics drink less.

3.2.4.Numerical and graphical exploration of variables distribution in relation to alcohol consumption.

Let’s start with a graphical overview of the dataset variable distribution, in relation to the alcohol consumption. By installing and calling "tidyr", "dplyr", and "ggplot2", and then by combining the function glimpse() via the pipe operator %>% to the plot generating function ggplot(); we get the following plot:

Cross-tabulations.

Let’s now focus on the cross-tabulations for the specific interest variables on which hypotheses were posed:

Go out and Alcohol use.
## # A tibble: 38 x 4
## # Groups:   goout [5]
##    goout alc_use count mean_grade
##    <int>   <dbl> <int>      <dbl>
##  1     1     1      14       11.1
##  2     1     1.5     3        7  
##  3     1     2       2       13.5
##  4     1     2.5     2       12  
##  5     1     3.5     1       10  
##  6     2     1      49       12.6
##  7     2     1.5    21       12.1
##  8     2     2      14       10.3
##  9     2     2.5     8       12.4
## 10     2     3       4       10.8
## # ... with 28 more rows
Free time and Alcohol use.
## # A tibble: 37 x 4
## # Groups:   freetime [5]
##    freetime alc_use count mean_grade
##       <int>   <dbl> <int>      <dbl>
##  1        1     1      10       10.3
##  2        1     1.5     1       16  
##  3        1     2       4       12.2
##  4        1     3       1       10  
##  5        1     4       1        8  
##  6        2     1      18       13.3
##  7        2     1.5    24       12.9
##  8        2     2       7       10.3
##  9        2     2.5     8       11.8
## 10        2     3       6       10.7
## # ... with 27 more rows
Study time and Alcohol use.
## # A tibble: 30 x 4
## # Groups:   studytime [4]
##    studytime alc_use count mean_grade
##        <int>   <dbl> <int>      <dbl>
##  1         1     1      21      12.2 
##  2         1     1.5    20      10   
##  3         1     2      17      10.1 
##  4         1     2.5    10      11.5 
##  5         1     3      12      10.1 
##  6         1     3.5    11       9.09
##  7         1     4       4      13   
##  8         1     5       5       9.6 
##  9         2     1      72      11.7 
## 10         2     1.5    35      12.1 
## # ... with 20 more rows
Romantic and Alcohol use.
## # A tibble: 18 x 4
## # Groups:   romantic [2]
##    romantic alc_use count mean_grade
##    <fct>      <dbl> <int>      <dbl>
##  1 no           1      98      12.3 
##  2 no           1.5    47      11.8 
##  3 no           2      35      11.3 
##  4 no           2.5    32      11.8 
##  5 no           3      25      10.4 
##  6 no           3.5    13      11.1 
##  7 no           4       6      10.3 
##  8 no           4.5     1      10   
##  9 no           5       4      10   
## 10 yes          1      42      11.1 
## 11 yes          1.5    22      11.2 
## 12 yes          2      24      11.2 
## 13 yes          2.5    12      11.7 
## 14 yes          3       7       8.71
## 15 yes          3.5     4       7.5 
## 16 yes          4       3      10   
## 17 yes          4.5     2      11   
## 18 yes          5       5      11.6

Bar plots.

Box plots.

Comments on findings and result comparation against previous hypotheses.

Overall, for the chosen variables, only the females seem to present outliers. As for the whiskers the females vary more than males except for the variable romantic. The only skewed plot is the romantic males’ one. Let’s now proceed to the hypotheses’ comparation.

"goout" and H1.

Concerning the variable "goout", I hypotesized (H1) that students who go out more experiences higher alcohol consumption. Overall, it seems that people who go out more, have higher consumptions of alcohol. The most starking differences are between class 1 and 3. But people at 2 have higher consumptions than people at 5.The higher consumption is registered at 3. So it is not self-evident that the more a student goes out, the more drinks, the hypothesis H1 is not entirely correct.

"freetime" and H2.

In regards to the variable "freetime", its related hypothesis (H2) was that students who have more free time are more prone to drink. Again, the levels for the answers 3 - 4 are much higher than 1 - 2 and 5. On a general level the hypothesis H2 seems correct. But the distributions of the results see a stark decrease at 5, which has even lower levels than at 2.

"studytime" and H3.

When it comes to the findings of the variable "studytime", it seems to corroborate the hypothesis (H3) according to which the more the studytime, the less a student drinks. The value in 2 is higher than 1 but the consumption decreases at the increasing of the studytime.

"romantic" and H4.

Finally, to the variable "romantic", I associated the hypothesis (H4) that romantics drink less than non-romantics, due to courting dynamics. The results seem to confirm the hypothesis H4. We see that non-romantics drink as much as twice compared to romantics, males percentage is higher than female one in this category, with a slightly larger gap in non-romantic.

3.2.5.Logistic regression.

By using logistic regression we statistically explore the relationship between the four selected variables and the binary high/low alcohol consumption variable as the target variable.

Summary of the fitted model and its interpretation.

## 
## Call:
## glm(formula = high_use ~ goout + freetime + studytime + romantic, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6927  -0.7743  -0.5479   1.0022   2.6060  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.31564    0.62593  -3.700 0.000216 ***
## goout        0.72871    0.12071   6.037 1.57e-09 ***
## freetime     0.08366    0.13415   0.624 0.532863    
## studytime   -0.58629    0.16508  -3.552 0.000383 ***
## romanticyes -0.18301    0.26722  -0.685 0.493442    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 399.98  on 377  degrees of freedom
## AIC: 409.98
## 
## Number of Fisher Scoring iterations: 4
Interpretation of the coefficients of the model as odd ratios, provision of confidence intervals for them.
## (Intercept)       goout    freetime   studytime romanticyes 
## -2.31564309  0.72871166  0.08366382 -0.58628767 -0.18300551

Odd ratio (OR) is obtained with the division of the odds of “success” (Y = 1) for students who have the property X, by the odds of “success” of those who have not. As OR quantifies the relations between X and Y, Odds higher than 1 indicates that X is positively associated with “success”. The odds ratios can be seen also as exponents of the coefficients of a logistic regression model.

Computation of the odds ratio (OR)

Computation of the confidence intervals for the coefficients by the function confint(), and exponentiation of the values by using exp().

## Waiting for profiling to be done...
##                  2.5 %     97.5 %
## (Intercept) -3.5691044 -1.1092870
## goout        0.4982923  0.9725916
## freetime    -0.1795761  0.3477794
## studytime   -0.9207772 -0.2716715
## romanticyes -0.7148612  0.3353712

Obtaining the odds ratio with their confidence intervals by using cbind():

##                     OR      2.5 %    97.5 %
## (Intercept) 0.09870269 0.02818108 0.3297940
## goout       2.07240892 1.64590816 2.6447898
## freetime    1.08726331 0.83562438 1.4159199
## studytime   0.55638895 0.39820941 0.7621045
## romanticyes 0.83276356 0.48926002 1.3984594

Result interpretation and comparison with previously formulated hypotheses.

Values bigger than 1 are seen fully in goout, freetime (except for 2,5%), and in 97.5% of romantic, here there is positive correlation. These results moslty confirmedv my hypotheses apart from the studytime.

3.2.6.Predictive power of the model.

First we use the function to predict() the probability of high use, after modifying the dataset 'alc' with the new integration we move to predict probabilities and classes, and to tabulate the target variables versus the predictions:

##     goout freetime studytime romantic probability prediction
## 373     2        3         1       no  0.23263069      FALSE
## 374     3        4         1       no  0.40585185      FALSE
## 375     3        3         3       no  0.16282192      FALSE
## 376     3        4         1      yes  0.36258869      FALSE
## 377     2        4         3       no  0.09258880      FALSE
## 378     4        3         2       no  0.42009575      FALSE
## 379     2        2         2       no  0.13429940      FALSE
## 380     1        1         2       no  0.06441395      FALSE
## 381     5        4         1       no  0.74578989       TRUE
## 382     1        4         1       no  0.13722123      FALSE
Cross tabulations and actual values vs predictions graphic.
##         prediction
## high_use FALSE TRUE
##    FALSE   247   21
##    TRUE     73   41

##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64659686 0.05497382 0.70157068
##    TRUE  0.19109948 0.10732984 0.29842932
##    Sum   0.83769634 0.16230366 1.00000000
Training error, and result comments. Model performance vs guessing.

Accuracy measures the performance in binary classifications as the average number of correctly classified observations. The mean of incorrectly classified observations can be seen as a penalty function of the classifier: the less the better. In this section, first we define a loss function loss_func(), and then we apply it to probability = 0, probability = 1 and then to the prediction probabilities in alc.

## [1] 0.2984293
## [1] 0.7015707
## [1] 0.2460733

The first and third functions deliver better results than in the case of probability = 1. It works better than guessing.

3.2.7. Bonus: 10-fold model cross-validation.

Cross validation is a technique to assess how the results of a statistical analysis will generalize to an independent data set. In a cross validation, a sample of data is partitioned into complementary subsets (training, larger and testing, smaller), performing the analysis on the former and validating the results on the latter. The subsets used here are K = 10.

## [1] 0.2460733

With leave-one-out cross validation:

## [1] 0.2460733

with 10-fold cross validation:

## [1] 0.2539267

The ten-fold cross validation shows higher prediction error on the testing data compared to the training data. It is also lower than the 0.26 in the Datacamp exercise.

3.2.8. Super Bonus: performance comparation via cross-validation (PLOTS MISSING).

At first I use a logistic regression model with 22 predictors.

## [1] 0.2460733

The function is performed with leave-one-out cross validation.

## [1] 0.2539267

Here the result is given by ten-fold cross validation.

## [1] 0.2539267

With 15 predictors

The function is performed with leave-one-out cross validation.

## [1] 0.2696335

Here the result is given by ten-fold cross validation.

## [1] 0.2617801

With 10 predictors

The function is performed with leave-one-out cross validation.

## [1] 0.2670157

Here the result is given by ten-fold cross validation.

## [1] 0.2748691

With 5 predictors

The function is performed with leave-one-out cross validation.

## [1] 0.2460733

Here the result is given by ten-fold cross validation.

## [1] 0.2460733

Chapter 4: Clustering and Classification.

4.2.Data Analysis.

4.2.2.Loading of the Boston data from the MASS Package. Description, structure and dimension.

The dataset Boston, is about the housing values in the suburbs of the homonym city. I use the functions str() and dim() to explore the dataset.Here is its structure:

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

And here its dimension:

## [1] 506  14

4.2.3.Graphical overview of the data and variables’ summary.

Let’s have a look at the summary() of the variables:

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Using the function pairs() we obtain the following graphical overview:

From the plot above is a bit difficult to see relations between variables. Let’s try to use something else, for instance a correlation plot. By using the function corrplot() we can obtain a visual way to look at correlations. First we need to calculate the correlation matrix by using cor():

##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

Now that we have the matrix, rounded to the first two digits, we can proceed to create the correlation plot by using corrplot(). Here is how it looks like:

Outputs’ description and interpretation of the variables’ distributions and relations.

The corrplot() provides us with a graphical overview of the Pearson’s correlation coefficient calculated with cor. This measure quantifies the degree to which an association tends to a certain pattern. In this case it summarize the strength of a linear association. As we see here, the value 0 means that two variables are uncorrelated. A value of -1 (in red) or 1 (in blue) shows that they are perfectly related.
As we can see here, the dimensions and intensity of colour of the dots visually shows the strenght of the linear associations. I used order = "hclust" as the ordering method for this correlation matrix as it makes the matrix more immediate to read. Among the strongest negative correlations there are: dis nox, dis indus, dis age, lstat rm, and lstat medv. On the contrary, among the strongest positive correlations we find: tax rad, tax indus, nox indus, nox age. Overall, only the variable chas seems to have very little if none statistical correlation at all.

4.2.4.Dataset standardization.

We scale the dataset by using the scale() function, then we can see the scaled variables with summary():

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

The function scale() operated on the variables by subtracting the columns means from the corresponding columns and dividing the difference with standard deviation. Here it was possible to scale the whole dataset as it contains only numerical values.

The class of the boston_scaled object is a:

## [1] "matrix"

so, to complete the procedure, we change the object into a data frame.

Creation of the categorical variable of the crime rate (with quantiles as break points).

To create the categorial variable, we use the function cut() together with quantile()to have our factor variable divided by quantiles in order to get four rates of crime:

## crime
##      low  med_low med_high     high 
##      127      126      126      127

Division of the dataset in train and test sets.

At first we use nrow() to count the number of rows in the dataset:

## [1] 506

then with ind <- sample() we randomly choose a 80% of them to create the train dataset. With the remaining material we create the test set.

4.2.5. Linear Discriminant Analysis (LDA).

In this section we fit a linear discriminant analysis on the train set, using the categorical crime rate as the target variable, and the other variables are the predictors. Here we can see the plot:

4.2.6.Class Prediction with LDA on the test data.

We will now run a LDA model on the test data, but before that we will save the crime categories from the test set and then we will remove the categorical crime variable from the test dataset.

Here is the cross tabulation of the results with the crime categories from the test set:

##           predicted
## correct    low med_low med_high high
##   low       12       8        1    0
##   med_low    8      17        5    0
##   med_high   0       5       14    2
##   high       0       0        0   30

Comments on the Results.

MISSING

4.2.7.Distance Measuring and Clustering of the Boston dataset.

To measure the distance between the observation, at first we standardize the dataset by using data.Normalization() with type = n1.

Then we run the distance between observations by using the function dist(), which utilizes the euclidean distance, the most common distance measure, then we use also the manhattan method:

After that, we calculate and visualize the total within sum of squares, by using set.seed(123) to prevent assigning random cluster centres, and setting the maximum number of clusters at 10.

Using the elbow method I think I will choose to go with three centers.

then we run the k-means(), I divide the plot in four to improve the clarity:

To be sure, I also try something different, for instance five.

RESULTS INTERPRETATION.

MISSING

BONUS

K-means and LDA on the original Boston data.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Here, as before, is the procedure for the k-means with clusters > 2:

again, we run the k-means():

And here the LDA model, since the variable chas appeared to be a constant within group calls I removed it:

RESULT INTERPRETATION.

The most influential variable as cluster linear separator is the variable tax.

SUPER-BONUS.

In this section we run the code on the train scaled dataset to produce two 3D-plots.

In this first 3D-plot the color is given by the train$crime:

In this second 3D-plot the color is defined by the km$cluster: